\[ \newcommand{\esp}[1]{\mathbb{E}\left(#1\right)} \newcommand{\var}[1]{\mbox{Var}\left(#1\right)} \newcommand{\deriv}[1]{\dot{#1}(t)} \newcommand{\prob}[1]{ \mathbb{P}\!(#1)} \newcommand{\eqdef}{\mathop{=}\limits^{\mathrm{def}}} \newcommand{\by}{\boldsymbol{y}} \newcommand{\bc}{\boldsymbol{c}} \newcommand{\bpsi}{\boldsymbol{\psi}} \newcommand{\pmacro}{\texttt{p}} \def\like{{\cal L}} \def\llike{{\cal LL}} \def\logit{{\rm logit}} \def\probit{{\rm probit}} \def\one{{\rm 1\!I}} \def\iid{\mathop{\sim}_{\rm i.i.d.}} \def\res{e} \newcommand{\argmin}[1]{{\rm arg}\min_{#1}} \newcommand{\argmax}[1]{{\rm arg}\max_{#1}} \]
Preliminary comments:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.7
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
The file
don <- read.csv("salesData/sales1.csv")
p1 <- don %>%
ggplot() +
aes(x = time, y = y) +
geom_point() +
geom_line() +
xlab("time") +
ylab("percentage of sales") +
ggtitle("Sales throughout time") +
theme_bw()
p1
After plotting the data we don’t recognize any higher order polynomial function, it actually looks like an affine function that we could resume with a straight line. We will hence fit a degree 1 polynomial to the data:
lm1 <- lm(y ~ time, data = don)
pred1 <- predict(lm1)
don <- data.frame(don,pred1)
summary(lm1)
##
## Call:
## lm(formula = y ~ time, data = don)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6994 -1.5777 -0.3596 1.6444 3.4747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.99688 0.94971 105.29 < 2e-16 ***
## time 0.37985 0.02643 14.37 1.17e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.2 on 19 degrees of freedom
## Multiple R-squared: 0.9158, Adjusted R-squared: 0.9113
## F-statistic: 206.5 on 1 and 19 DF, p-value: 1.166e-11
p2 <- p1 +
geom_line(data = don, aes(x = time, y = pred1), color = "red") +
theme_bw()
p2
par(mfrow=c(2,2))
plot(lm1)
time variable is statistically significant according to the t-value, as well as the intercept. From the F-statistic p-value we see that adding the varaible time is better than an intercept only model and the adjusted R^2 is pretty high. Diagnostic plots confirm gaussian linear model hypothesises (equal variance of residuals, normal distribution of residuals). Leverage and influence seems to be acceptable from cook’s distance point of view.
Here time is measured in months, we could thus try to add periodicity through years and aggregate time on 12. We’ll use the \(\sin(2\pi t/T)\) and add it to our model:
## Using cos:
y = ß0 + ß1t(i) + ß2cos(Zpit(i)/T) + e(i))
lm2 <- lm(y ~ time + (cos((2*pi*time)/12)), data = don)
pred2 <- predict(lm2)
don <- data.frame(don,pred2)
p3 <- p1 +
geom_line(data = don, aes(x = time, y = pred2), color = "red") +
theme_bw()
p3
summary(lm2)
##
## Call:
## lm(formula = y ~ time + (cos((2 * pi * time)/12)), data = don)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1087 -1.3212 -0.1863 1.5072 3.4641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.03019 0.87687 114.076 < 2e-16 ***
## time 0.37706 0.02444 15.430 8.03e-12 ***
## cos((2 * pi * time)/12) 1.28801 0.62150 2.072 0.0529 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.031 on 18 degrees of freedom
## Multiple R-squared: 0.932, Adjusted R-squared: 0.9244
## F-statistic: 123.3 on 2 and 18 DF, p-value: 3.116e-11
par(mfrow=c(2,2))
plot(lm2)
The new periodic term is just statistically significant at the sacrosanct level of 0.5 according to the t-value, all the other variables remained highly significant. From the F-statistic p-value we see that adding this second cosinus periodic term is better than an intercept only model and the adjusted R^2 is higher than the previous model. Diagnostic plots confirm gaussian linear model hypothesises (equal variance of residuals, normal distribution of residuals). Leverage and influence seems to have improved compared to the previous model.
## Using sinus:
y = ß0 + ß1t(i) + ß2sin(Zpit(i)/T) + e(i))
lm3 <- lm(y ~ time + (sin((2*pi*time)/12)), data = don)
pred3 <- predict(lm3)
don <- data.frame(don,pred3)
p4 <- p1 +
geom_line(data = don, aes(x = time, y = pred3), color = "red") +
theme_bw()
p4
summary(lm3)
##
## Call:
## lm(formula = y ~ time + (sin((2 * pi * time)/12)), data = don)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6552 -0.6945 0.4861 0.6453 2.1662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.6598 0.5896 169.04 < 2e-16 ***
## time 0.3889 0.0164 23.71 5.02e-15 ***
## sin((2 * pi * time)/12) 2.4069 0.4267 5.64 2.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.359 on 18 degrees of freedom
## Multiple R-squared: 0.9696, Adjusted R-squared: 0.9662
## F-statistic: 286.6 on 2 and 18 DF, p-value: 2.246e-14
par(mfrow=c(2,2))
plot(lm3)
The new sinus periodic term is statistically significant according to the t-value, as well as the intercept and time. From the F-statistic p-value we see that adding this second model is better than an intercept only model and the adjusted R^2 is higher than all our other previous models. Diagnostic plots confirm gaussian linear model hypothesises (equal variance of residuals, normal distribution of residuals) although it seems that the residuals aren’t as evenly distributed as before. Nevertheless they remain acceptable. Leverage and influence seems to have improved compared to the degree 1 polynomial model.
Although there are some improvements we could try using both terms in the model to see if they yield something better:
y = ß0 + ß1t(i) + ß2sin(Zpit(i)/T) + ß3cos(Zpi*t(i)/T) + e(i))
lm4 <- lm(y ~ time + (sin((2*pi*time)/12)) + (cos((2*pi*time)/12)), data = don)
pred4 <- predict(lm4)
don <- data.frame(don,pred4)
p5 <- p1 +
geom_line(data = don, aes(x = time, y = pred4), color = "red") +
theme_bw()
p5
summary(lm4)
##
## Call:
## lm(formula = y ~ time + (sin((2 * pi * time)/12)) + (cos((2 *
## pi * time)/12)), data = don)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7557 -0.4608 0.1007 0.8127 1.2649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.69815 0.44996 221.571 < 2e-16 ***
## time 0.38611 0.01254 30.797 2.36e-16 ***
## sin((2 * pi * time)/12) 2.35218 0.32593 7.217 1.44e-06 ***
## cos((2 * pi * time)/12) 1.18481 0.31757 3.731 0.00166 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.037 on 17 degrees of freedom
## Multiple R-squared: 0.9833, Adjusted R-squared: 0.9803
## F-statistic: 332.9 on 3 and 17 DF, p-value: 2.716e-15
par(mfrow=c(2,2))
plot(lm4)
All terms are statistically significant and the R-squared has even increased further. We can also see an improvement on equal variance and indepently distributed residuals from our previous model.
To be sure which model is best untill now, we can compare all 4 with the BIC and AIC criterions for these models:
BIC(lm1,lm2,lm3,lm4)
## df BIC
## lm1 3 99.74920
## lm2 4 98.30006
## lm3 4 81.41808
## lm4 5 71.90108
AIC(lm1,lm2,lm3,lm4)
## df AIC
## lm1 3 96.61563
## lm2 4 94.12197
## lm3 4 77.23999
## lm4 5 66.67847
Seems that the most adequate model according to both criterions is the sinusoïdal and cosinus periodic term model.
p5
par(mfrow=c(2,2))
plot(lm4)
Again as I mentionned before, the terms in our model are individually statistically significant according to the t-test, the adjusted R^2 is highest among our models and both the AIC and BIC criterions confort us in our choice of this model. Concerning residuals, despite a slight deviation of the equal variance and independently distributed residuals assumptions of the guassian linear model, the model is valid remains valid. So the model seems to fit our data correctly, but I’m afraid of overfitting.
We get rid of the previous intercept and force it to be equal to 100: y(i) = f(time(i)) + e(i) with: f(time) = 100 + ß1time + ß2sin(2pitime/12) + ß3(cos(2pi*time/12)-1)
lm5 <- lm(y -100 ~ -1 + time + I(sin((2*pi*time)/12)) + I(cos((2*pi*time)/12)-1), data = don)
pred5 <- predict(lm5, don) +100
don <- data.frame(don,pred5)
p6 <- don %>%
ggplot() +
aes(x = time, y = y) %>%
geom_point() +
geom_line(aes(x = time, y = pred5), color = "red") +
theme_bw()
p6
summary(lm5)
##
## Call:
## lm(formula = y - 100 ~ -1 + time + I(sin((2 * pi * time)/12)) +
## I(cos((2 * pi * time)/12) - 1), data = don)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9858 -0.4548 0.2311 0.8582 1.7732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## time 0.400663 0.008894 45.049 < 2e-16 ***
## I(sin((2 * pi * time)/12)) 2.408199 0.337386 7.138 1.19e-06 ***
## I(cos((2 * pi * time)/12) - 1) 0.888120 0.267237 3.323 0.00378 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.08 on 18 degrees of freedom
## Multiple R-squared: 0.9948, Adjusted R-squared: 0.9939
## F-statistic: 1138 on 3 and 18 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm5)
To check that the intercept is indeed set to 100:
NEW_PT <- data.frame(time = 0)
predict(lm5,NEW_PT) + 100
## 1
## 100
The file
don2 <- read.csv("salesData/sales30.csv")
p1 <- don2 %>%
ggplot() +
aes(x = time, y = y, color = as.factor(id)) +
geom_point() +
xlab("time") +
ylab("sales") +
ggtitle("quarterly sales for 30 different products")
p1
lm6 <- lm(y ~ time + (sin((2*pi*time)/12)) + (cos((2*pi*time)/12)),
data = don2)
pred6 <- predict(lm6)
don2 <- data.frame(don2,pred6)
p2 <- p1 +
geom_line(data = don2, aes(x = time, y = pred6)) +
facet_wrap(~id)
p2
summary(lm6)
##
## Call:
## lm(formula = y ~ time + (sin((2 * pi * time)/12)) + (cos((2 *
## pi * time)/12)), data = don2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.7030 -2.9607 0.1117 3.3645 17.4116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.09336 0.42471 235.676 < 2e-16 ***
## time 0.21655 0.01183 18.299 < 2e-16 ***
## sin((2 * pi * time)/12) 1.24348 0.30764 4.042 5.96e-05 ***
## cos((2 * pi * time)/12) 0.89834 0.29975 2.997 0.00283 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.36 on 626 degrees of freedom
## Multiple R-squared: 0.3634, Adjusted R-squared: 0.3603
## F-statistic: 119.1 on 3 and 626 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lm6)
Here we do not take into account randomness in the intercept for each model, which explains why sometimes the fit is way above or way under the actual datapoints.
library(lme4)
don2 <- don2 %>%
mutate(cos_per = cos((2*pi*time)/12)) %>%
mutate(sin_per = sin((2*pi*time)/12))
There are various ways we can introduce randomness in our model, we’ll proceed sequentially:
This was our first intuition when looking at the plot above
lmr1 <- lmer(y ~ (1|id) + time + sin_per + cos_per, data = don2)
Seems there could also be also randomness on the slope when looking closer!
lmr2 <- lmer(y ~ (time|id) + sin_per + cos_per, data = don2)
lmr3 <- lmer(y ~ (time|id) +
sin_per +
cos_per +
(cos_per + time|id),
data = don2)
lmr4 <- lmer(y ~ time + sin_per + (time + sin_per | id) + cos_per, data = don2)
lmr5 <- lmer(y ~ (time|id) +
cos_per +
(time + cos_per | id) +
sin_per,
data = don2)
lmr6 <- lmer(y ~ time +
cos_per +
(time + cos_per | id) +
sin_per,
data = don2)
lmr7 <- lmer(y ~ (time|id) +
cos_per +
(time + cos_per | id) +
(time + sin_per | id),
data = don2)
lmr8 <- lmer(y ~ time +
cos_per +
(time + cos_per | id) +
(time + sin_per | id),
data = don2)
Now to compare all these models let’s use BIC and AIC criterion:
AIC(lmr1,lmr2,lmr3,lmr4,lmr5,lmr6,lmr7,lmr8)
## df AIC
## lmr1 6 3457.001
## lmr2 7 3021.690
## lmr3 13 3032.737
## lmr4 11 2156.560
## lmr5 13 3032.970
## lmr6 11 2997.041
## lmr7 18 2210.082
## lmr8 16 2168.809
BIC(lmr1,lmr2,lmr3,lmr4,lmr5,lmr6,lmr7,lmr8)
## df BIC
## lmr1 6 3483.675
## lmr2 7 3052.810
## lmr3 13 3090.531
## lmr4 11 2205.463
## lmr5 13 3090.764
## lmr6 11 3045.944
## lmr7 18 2290.105
## lmr8 16 2239.940
There is a lot of other combinations possible, but at first hand we can see that the model taking into account random effects for the sinusoïdal periodic term without randomness on the intercept really stands out. We can check if removing correlation between random effects can further improve our best model here
lmr9 <- lmer(y ~ time + sin_per + ( -1 + time + sin_per | id) + cos_per, data = don2)
BIC(lmr9,lmr4)
## df BIC
## lmr9 8 2186.126
## lmr4 11 2205.463
AIC(lmr9,lmr4)
## df AIC
## lmr9 8 2150.56
## lmr4 11 2156.56
We will therefore consider the following model:
y(i,t) = ß0 + ß1(i)time(i,t) + ß2(i)sin(2pitime(i,t)/12) + ß3(cos(2pi*time(t)/12)-1) + e(i,t)
pred9 <- predict(lmr9)
don2 <- data.frame(don2,pred9)
p9 <- p1 +
geom_line(data = don2, aes(x = time, y = pred9)) +
facet_wrap(~id)
p9
Indeed the models seem to fit better than our first try.
We use the same model as before taking into account this constraint and the different ids:
y(i,t) = 100 + ß1(i)time(i,t) + ß2(i)sin(2pitime(i,t)/12) + ß3(cos(2pi*time(t)/12)-1) + e(i,t)
lmr10 <- lmer(y -100 ~ -1 +
time +
I(sin((2*pi*time)/12)) +
(-1 + time + I(sin((2*pi*time)/12)) || id) +
I(cos((2*pi*time)/12)-1),
data = don2)
pred10 <- predict(lmr10,don2) + 100
don2 <- data.frame(don2,pred10)
p10 <- p1 +
geom_line(data = don2, aes(x = time, y=pred10, color = as.factor(id))) +
theme_bw()
p10
The file
The final model of part 2 will be used here. In other words, you should not use the new data to fit any new model.
don3 <- read.csv("salesData/salesNew.csv")
lmr <- lmer(y ~ time + cos_per + sin_per + (-1 + time + sin_per || id), don2)
If we don’t have any new observations X we must infer our prediction from y_hat yielded by our previous model.
t <- seq(0,66,0.2)
X <- cbind(1,t,cos(2*pi*t/12)-1, sin(2*pi*t/12))
beta <- fixef(lmr)
pred <- data.frame(time=t, pred0=X %*% beta)
p13 <- ggplot(don3, aes(x = time, y = y)) +
geom_point() +
xlab("time") +
ylab("sales") +
geom_line(data = pred, aes(time, pred0), color = "red")
p13
The results aren’t astonishing, the curve is off compared to our datapoints!
Suppose now that only the first data at time 1 is available for this product. Compute and plot the new predictions.
Repeat the same process with an increasing number of observed data. Comment the results.
I’ll be answering these two questions in one
This is what we call a-priori information, thanks to this we can access the so called Bayesian estimators which use conditional laws of X and Y to retrieve a supposedly more accurate estimator. In our case we can consider the MAP estimator we’ve seen in class.
summary(lmr)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ time + cos_per + sin_per + ((0 + time | id) + (0 + sin_per |
## id))
## Data: don2
##
## REML criterion at convergence: 2137.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.04054 -0.60956 0.02221 0.60432 3.06362
##
## Random effects:
## Groups Name Variance Std.Dev.
## id time 0.01878 0.137
## id.1 sin_per 8.42646 2.903
## Residual 1.03600 1.018
## Number of obs: 630, groups: id, 30
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 100.09336 0.08064 1241.178
## time 0.21655 0.02512 8.621
## cos_per 0.89834 0.05692 15.784
## sin_per 1.24348 0.53319 2.332
##
## Correlation of Fixed Effects:
## (Intr) time cos_pr
## time -0.077
## cos_per 0.023 -0.005
## sin_per -0.011 0.001 -0.005
Sig <- diag(c(0.01878,8.42647))
sig <- solve(Sig)
res <- 1.036
t <- seq(1,61,3)
pts <- seq(1,20,3)
for (j in pts){
betaj <- beta
yj <- don3$y[1:j]
tj <- t[1:j]
Xj <- cbind(1, tj, cos(2*pi*tj/12)-1, sin(2*pi*tj/12))
Aj <- cbind(tj, sin(2*pi*tj/12))
coefj <- solve(t(Aj) %*% Aj/res + sig)
coeffj <- coefj %*% t(Aj) %*% (yj - Xj %*% beta)/res
betaj["time"] <- beta["time"] + coeffj[1]
betaj["sin_per"] <- beta["sin_per"] + coeffj[2]
pred$predj <- X %*% betaj
print(p13 +
geom_line(data = pred, aes(x = time, y = predj), color = "green"))
}
We can see the fit improving as we increase the number of data points we integrate in our estimation, which was to be expected.